Extracting data from a complicated pdf file

Spaces between numbers and ASCII characters

R
text extraction
research
East Asia
dataset
Author

Kadir Jun Ayhan

Published

Monday, August 14, 2023

The problem(s)

I was helping a Ph.D. student on her paper on China’s outward foreign direct investment (OFDI). So far, I have been using the OFDI data she gave me only to realize that some data was wrongly extracted in the first place. For example, some numbers were written as “214.121.94” instead of “214,121.94”.

So I decided to extract the data from the pdf file myself, but it was not an easy task given many hurdles along the way. I was finally able to pull it off with the help of colleagues on Stackoverflow and ChatGPT. I adapted the original code for extracting data from a pdf file from Rafael A. Irizarry’s Introduction to Data Science: Data Analysis and Prediction Algorithms with R book, especially Chapter 24: String Processing.

The data is from Chinese Ministry of Commerce.

What made data extraction in this instance challenging were mainly two things:

  1. The numbers were written with spaces between them. For example, instead of 64219 it was 64 219, not even 64,219.

  2. Worse, the numbers used not UTF-8 character codes, but ASCII characters which took me some time to figure out.

And there were some smaller problems along the way. So, here is my journey of extracting data from Chinese official sources and converting it to help others who may come across similar problems, and also to keep record of what I have done for my future self.

Loading Libraries

You can either download the pdf file and load it from your directory, or you can temporarily download it from internet and work on it without having the pdf file.

File

Code
#file <- tempfile()
 
#url <- paste0("http://images.mofcom.gov.cn/fec/202211/20221118091910924.pdf")
 
#download.file(url, file, headers = c("User-Agent" = "My Custom User Agent"))


file <- paste0("2021 Statistical Bulletin of China's outward foreign direct investment.pdf")

Now, we will extract all text from pdf into R:

Code
pdf_data <- pdf_text(file)

We will need some functions to be used later. I add them here with some explanations.

Functions

This one replaces all spaces and commas with an empty string, to make sure that I can later parse numbers from these characters.

Code
replace_spaces_and_commas <- function(x) {
  str_replace_all(x, "[ ,]", "")
}

I have the country/ region name in the first column. For some reason, for some rows, the data for some countries/ regions are not in the same row but in the next row beginning from the first column. So, the following function helps me move the data entries that begin from column 1 (which is supposed to be country/ region column) to the previous row’s second to tenth columns.

Code
# some entries move to the next rows.

fix_data_alignment <- function(df) {
  for (i in 1:(nrow(df) - 1)) {
    ifelse(df[i, 2] == "", df[i, 2:10] <- df[i + 1, 1:9], df[i, 2:10])
  }
  return(df)
}

The following was the trickiest part that was difficult to make work. The numbers in the columns were extracted as follows (for example): -122, 29458, 9.

If you ask R if these are the same as -122, 29458, 9, respectively, the answer is no.

Code
"-122" == "-122"
[1] FALSE

I had no idea why. I asked on Stackoverflow, and Andre Wilberg suggested that

A base R approach using utf8ToInt to get an integer from the utf8 code, then subtracting 65248 from integers that are in a specific range (> 126, see ascii table) to get the desired number, finally bringing the integer back to utf8 with intToUtf8.

Code
as.numeric(
  unlist(strsplit(paste(
    sapply(unlist(strsplit("-122, 29458, 9", "")), \(x)
      ifelse(utf8ToInt(x) > 126, intToUtf8(utf8ToInt(x) - 65248), x)),
    collapse=""), ", ")))
[1]  -122 29458     9

I got some help from ChatGPT (many back and forth conversation to make this work) to turn this into a function.

Code
convert_fullwidth_to_numeric <- function(input_str) {
  utf8_codes <- utf8ToInt(input_str)
  
  # Handle fullwidth minus sign (-) separately
  utf8_codes <- ifelse(utf8_codes == 65293, 45, utf8_codes)
  
  converted_utf8_codes <- ifelse(utf8_codes >= 65296 & utf8_codes <= 65305, utf8_codes - 65248, utf8_codes)
  converted_chars <- intToUtf8(converted_utf8_codes)
  converted_numeric <- as.numeric(converted_chars)
  return(converted_numeric)
}

Separate pages

Now, the pages that I need are from 53 to 64 (not the document pages, but the pdf pages). So I create a new variable with only these pages, and separate them based on pages. So, there will be 12 pages within this list. Using a for loop, I assign names to each page based on the number of the page in the table that I am interested in (tab_pdf_1, tab_pdf_2 etc.).

Code
pdf <- pdf_data[53:64]

tab_pdf <- str_split(pdf, "\n")

for (i in 1:12) {
  assign(paste0("tab_pdf_", i), tab_pdf[[i]])
}

I assign the column names:

Code
the_names <- c("country", "year_2013", "year_2014", "year_2015", "year_2016", "year_2017", "year_2018", "year_2019", "year_2020", "year_2021")

Data Cleaning

Now I can begin data cleaning.

I look at each page separately to see from which line the data begins and where it ends, and whether there are any potential errors in the pages.

Code
view(tab_pdf_1)
# and do the same for consecutive pages.

From the first page to the sixth, my code works fine. I will explain what the code does with annotations within the code chunk for these pages. I asked ChatGPT to annotate my code, but for more, you can see Rafael A. Irizarry’s Introduction to Data Science: Data Analysis and Prediction Algorithms with R book, especially Chapter 24: String Processing.

This cleaning process could have been a function to automate the process, but I preferred doing it one by one to see if the data cleaning process work as expected for each page since the table design could be different from page to page, and there could be different errors for each page. A function is still probably the better option, but maybe next time.

Code
# Select rows 14 to 60 from the tab_pdf_1 data frame
pdf_clean1 <- tab_pdf_1[14:60] %>%

  # Trim whitespace from both ends of each element in the selected rows
  str_trim %>%

  # Replace all commas with an empty string in each element
  str_replace_all(",", "") %>%

  # Split each element based on 2 or more consecutive spaces and simplify into a matrix
  str_split("\\s{2,}", simplify = TRUE) %>%

  # Convert the matrix into a data frame with non-factor columns
  data.frame(stringsAsFactors = FALSE) %>%

  # Set column names for the data frame using the provided 'the_names' vector
  setNames(the_names) %>%

  # Apply the 'replace_spaces_and_commas' function to all columns
  mutate_all(.funs = replace_spaces_and_commas) %>%

  # Filter out rows where the 'country' column is not empty.
  # I do this because there were many empty columns
  
  filter(country != "")

# The rest are the same. The only changes are the line/ row numbers.

pdf_clean2 <- tab_pdf_2[6:69] %>%
  str_trim %>%
  str_replace_all(",", "") %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "") 


pdf_clean3 <- tab_pdf_3[6:69] %>%
  str_trim %>%
  str_replace_all(",", "") %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "") 


pdf_clean4 <- tab_pdf_4[6:69] %>%
  str_trim %>%
  str_replace_all(",", "") %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "") 


pdf_clean5 <- tab_pdf_5[6:43] %>%
  str_trim %>%
  str_replace_all(",", "") %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "") 


pdf_clean6 <- tab_pdf_6[6:44] %>%
  str_trim %>%
  str_replace_all(",", "") %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "") 

After the 7th page, the table design changes (it reports stocks). There are asterisks (again not “*“, but”∗” which was difficult to figure out initially).

Code
pdf_clean7 <- tab_pdf_7[6:71] %>%
  str_trim %>%
  str_replace_all("\\∗", "") %>%  # Remove asterisks
  str_replace_all(",", "") %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "")

Furthermore, as explained above, for some reason, for some rows, the data for some countries/ regions are not in the same row but in the next row beginning from the first column. So, the following function helps me move the data entries that begin from column 1 (which is supposed to be country/ region column) to the previous row’s second to tenth columns.

Code
pdf_clean7 <- fix_data_alignment(pdf_clean7)

This leaves unnecessary rows which are not for any country/ region. I find and filter them out (they have the last columns as empty strings – not NA though, just empty string).

Code
pdf_clean7 %<>% filter(pdf_clean7[,10] != "")

More of the same for the rest of the pages. There was only one minor error. The Hong Kong entry on page 8 could not be extracted using my string processing code. So, I just manually added it (It would be faster than figuring out how to do it with code.)

Code
pdf_clean8 <- tab_pdf_8[6:55] %>%
  str_trim %>%
  str_replace_all("\\∗", "") %>%  # Remove asterisks
  str_replace_all(",", "") %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "") 


pdf_clean8 <- fix_data_alignment(pdf_clean8)

# This leaves unnecessary rows which have the last columns as empty strings (not NA). Filter them out.

pdf_clean8 %<>% filter(pdf_clean8[,10] != "")

# Hong Kong is gone because it couldn't be extracted with the above code. Let's try getting it.

hk <- c(
  "中国香港",
  "37 709 314",
  "50 991 983",
  "65 685 524",
  "78 074 489",
  "98 126 568",
  "110 039 108",
  "127 535 518",
  "143 853 092",
  "154 965 764"
)

# Creating a data frame for Honk Kong OFDI values and using the same column names.

pdf_clean13 <- data.frame(setNames(data.frame(t(hk)), the_names))

pdf_clean9 <- tab_pdf_9[6:74] %>%
  str_trim %>%
  str_replace_all("\\∗", "") %>%  # Remove asterisks
  str_replace_all(",", "") %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "") 


pdf_clean9 <- fix_data_alignment(pdf_clean9)

pdf_clean9 %<>% filter(pdf_clean9[,10] != "")


pdf_clean10 <- tab_pdf_10[6:53] %>%
  str_trim %>%
  str_replace_all("\\∗", "") %>%  # Remove asterisks
  str_replace_all(",", "") %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "") 

pdf_clean10 <- fix_data_alignment(pdf_clean10)

pdf_clean10 %<>% filter(pdf_clean10[,10] != "")



pdf_clean11 <- tab_pdf_11[6:76] %>%
  str_trim %>%
  str_replace_all("\\∗", "") %>%  # Remove asterisks
  str_replace_all(",", "") %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "") 


pdf_clean11 <- fix_data_alignment(pdf_clean11)

pdf_clean11 %<>% filter(pdf_clean11[,10] != "")


pdf_clean12 <- tab_pdf_12[6:72] %>%
  str_trim %>%
  str_replace_all("\\∗", "") %>%  # Remove asterisks
  str_replace_all(",", "") %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "") 


pdf_clean12 <- fix_data_alignment(pdf_clean12)

pdf_clean12 %<>% filter(pdf_clean12[,10] != "")

Combine all

Now, let’s combine all this data together. I was lazy to use rbind by writing dataframe names manually. So, I asked ChatGPT to give me a loop version of doing so, and it delivered the following.

Code
# OFDI first

# Create an empty list to store data frames
df_list <- list()

# Loop through the indices and populate the list with data frames
for (i in 1:6) {
  df_list[[i]] <- get(paste0("pdf_clean", i))
}

# Bind the data frames together using rbind
pdf_clean_all_ofdi <- do.call(rbind, df_list)

pdf_clean_all_ofdi %<>% mutate(type = "OFDI")
# Repeat the same for stocks

# Create an empty list to store data frames
df_list2 <- list()

# Loop through the indices and populate the list with data frames
for (i in 7:13) {
  df_list2[[i]] <- get(paste0("pdf_clean", i))
}

# Bind the data frames together using rbind
pdf_clean_all_stocks <- do.call(rbind, df_list2)


pdf_clean_all_stocks %<>% mutate(type = "Stocks")

pdf_clean_all <- rbind(pdf_clean_all_ofdi, pdf_clean_all_stocks)

Transform the columns into numeric

We have our final dataframe, but remember these are characters. Using readr’s parse_number function would not work to convert these characters into numbers, because the characters in question are not UTF-8, but ASCII. So, I used the function that I explained above within a for loop (again, thanks to ChatGPT for some help) to extract numbers from these characters.

Code
# Apply the function to specified columns (columns 2 to 10)
columns_to_transform <- 2:10  # Adjust column indices as needed

for (col in columns_to_transform) {
  for (row in 1:nrow(pdf_clean_all)) {
    pdf_clean_all[row, col] <- convert_fullwidth_to_numeric(pdf_clean_all[row, col])
  }
}

Country Code

Well, there are some options to translate Chinese language into English within R, but I didn’t use them. Another option is writing a .csv file, uploading it on Google Sheets, using its translation function and exporting the csv file and uploading it back to R, and using countrycode to get ISO3 codes for the countries. I haven’t done this either yet.

But this is the easiest problem given the hurdles of data extraction process from this pdf file which had spaces between numbers, and numbers that were not numbers in the traditional sense. The following writes the csv file.

Write .csv

Code
writexl::write_xlsx(pdf_clean_all, "ofdi_chinese_extracted_2013_2021.xlsx")